- Hasta el momento, hemos visto datasets pequeños, con pocas características (columnas).
- Datasets con un mayor número de columnas son muy comunes. Estos producen:
- Entrenamientos más lentos
- Dificultad para encontrar una buena solución
- Es en este contexto donde se plantea la reducción de dimensionalidad
- Reducción de dimensionalidad viene acompañada de pérdida de información (ej: jpeg)
Author: Christophe Mehay
Principal Components Analysis (PCA-Análisis de Componentes Principales)¶
- Una de las técnicas más populares para reducción de dimensionalidad
- Identifica el hiperplano más cercano a los datos, y luego proyecta los datos en él.
- PCA nos permite:
- Poder visualizar datos con alta dimensionalidad que de otra forma no sería posible
- Reducción de dimensionalidad para mejora en la velocidad de entrenamiento
- En PCA, es importante que los datos se encuentren en la misma escala (sino, se debe escalar)
- PCA identifica los ejes de mayor varianza (componentes principales)
Usando PCA para visualización¶
In [1]:
Copied!
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
import pandas as pd
cancer_data = load_breast_cancer(as_frame=True) # Para obtener datos como dataframe
print(cancer_data.target_names) # 0 Maligno, 1 benigno
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
import pandas as pd
cancer_data = load_breast_cancer(as_frame=True) # Para obtener datos como dataframe
print(cancer_data.target_names) # 0 Maligno, 1 benigno
['malignant' 'benign']
In [2]:
Copied!
cancer_data.frame.head()
cancer_data.frame.head()
Out[2]:
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0 |
5 rows × 31 columns
In [3]:
Copied!
len(cancer_data.feature_names)
len(cancer_data.feature_names)
Out[3]:
30
In [4]:
Copied!
df = cancer_data.frame
print(df.info())
df.head()
df = cancer_data.frame
print(df.info())
df.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 mean radius 569 non-null float64 1 mean texture 569 non-null float64 2 mean perimeter 569 non-null float64 3 mean area 569 non-null float64 4 mean smoothness 569 non-null float64 5 mean compactness 569 non-null float64 6 mean concavity 569 non-null float64 7 mean concave points 569 non-null float64 8 mean symmetry 569 non-null float64 9 mean fractal dimension 569 non-null float64 10 radius error 569 non-null float64 11 texture error 569 non-null float64 12 perimeter error 569 non-null float64 13 area error 569 non-null float64 14 smoothness error 569 non-null float64 15 compactness error 569 non-null float64 16 concavity error 569 non-null float64 17 concave points error 569 non-null float64 18 symmetry error 569 non-null float64 19 fractal dimension error 569 non-null float64 20 worst radius 569 non-null float64 21 worst texture 569 non-null float64 22 worst perimeter 569 non-null float64 23 worst area 569 non-null float64 24 worst smoothness 569 non-null float64 25 worst compactness 569 non-null float64 26 worst concavity 569 non-null float64 27 worst concave points 569 non-null float64 28 worst symmetry 569 non-null float64 29 worst fractal dimension 569 non-null float64 30 target 569 non-null int64 dtypes: float64(30), int64(1) memory usage: 137.9 KB None
Out[4]:
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0 |
5 rows × 31 columns
In [5]:
Copied!
df.hist(bins=20,figsize=(15,10))
plt.tight_layout()
plt.show()
df.hist(bins=20,figsize=(15,10))
plt.tight_layout()
plt.show()
In [6]:
Copied!
from sklearn.preprocessing import StandardScaler
X = cancer_data.data
y = cancer_data.target
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=cancer_data.feature_names)
from sklearn.preprocessing import StandardScaler
X = cancer_data.data
y = cancer_data.target
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=cancer_data.feature_names)
In [7]:
Copied!
X.head()
X.head()
Out[7]:
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.097064 | -2.073335 | 1.269934 | 0.984375 | 1.568466 | 3.283515 | 2.652874 | 2.532475 | 2.217515 | 2.255747 | ... | 1.886690 | -1.359293 | 2.303601 | 2.001237 | 1.307686 | 2.616665 | 2.109526 | 2.296076 | 2.750622 | 1.937015 |
| 1 | 1.829821 | -0.353632 | 1.685955 | 1.908708 | -0.826962 | -0.487072 | -0.023846 | 0.548144 | 0.001392 | -0.868652 | ... | 1.805927 | -0.369203 | 1.535126 | 1.890489 | -0.375612 | -0.430444 | -0.146749 | 1.087084 | -0.243890 | 0.281190 |
| 2 | 1.579888 | 0.456187 | 1.566503 | 1.558884 | 0.942210 | 1.052926 | 1.363478 | 2.037231 | 0.939685 | -0.398008 | ... | 1.511870 | -0.023974 | 1.347475 | 1.456285 | 0.527407 | 1.082932 | 0.854974 | 1.955000 | 1.152255 | 0.201391 |
| 3 | -0.768909 | 0.253732 | -0.592687 | -0.764464 | 3.283553 | 3.402909 | 1.915897 | 1.451707 | 2.867383 | 4.910919 | ... | -0.281464 | 0.133984 | -0.249939 | -0.550021 | 3.394275 | 3.893397 | 1.989588 | 2.175786 | 6.046041 | 4.935010 |
| 4 | 1.750297 | -1.151816 | 1.776573 | 1.826229 | 0.280372 | 0.539340 | 1.371011 | 1.428493 | -0.009560 | -0.562450 | ... | 1.298575 | -1.466770 | 1.338539 | 1.220724 | 0.220556 | -0.313395 | 0.613179 | 0.729259 | -0.868353 | -0.397100 |
5 rows × 30 columns
In [8]:
Copied!
# Todos los gráficos entre pares de columnas!
colors = {0:'red', 1:'green'}
color_map = y.map(colors)
columns = X.columns.tolist()
i = 0
for x in columns:
columns.remove(x)
for y in columns:
print("Plotting ", x, " vs ", y)
X.plot.scatter(x=x, y=y, c=color_map)
i += 1
print("Número de gráficos: ", i)
plt.show()
# Todos los gráficos entre pares de columnas!
colors = {0:'red', 1:'green'}
color_map = y.map(colors)
columns = X.columns.tolist()
i = 0
for x in columns:
columns.remove(x)
for y in columns:
print("Plotting ", x, " vs ", y)
X.plot.scatter(x=x, y=y, c=color_map)
i += 1
print("Número de gráficos: ", i)
plt.show()
Plotting mean radius vs mean texture Plotting mean radius vs mean perimeter Plotting mean radius vs mean area Plotting mean radius vs mean smoothness Plotting mean radius vs mean compactness Plotting mean radius vs mean concavity Plotting mean radius vs mean concave points Plotting mean radius vs mean symmetry Plotting mean radius vs mean fractal dimension Plotting mean radius vs radius error Plotting mean radius vs texture error Plotting mean radius vs perimeter error Plotting mean radius vs area error Plotting mean radius vs smoothness error Plotting mean radius vs compactness error Plotting mean radius vs concavity error Plotting mean radius vs concave points error Plotting mean radius vs symmetry error Plotting mean radius vs fractal dimension error Plotting mean radius vs worst radius Plotting mean radius vs worst texture
/home/polivares/anaconda3/envs/DataScience/lib/python3.8/site-packages/pandas/plotting/_matplotlib/core.py:345: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). fig = self.plt.figure(figsize=self.figsize)
Plotting mean radius vs worst perimeter Plotting mean radius vs worst area Plotting mean radius vs worst smoothness Plotting mean radius vs worst compactness Plotting mean radius vs worst concavity Plotting mean radius vs worst concave points Plotting mean radius vs worst symmetry Plotting mean radius vs worst fractal dimension Plotting mean perimeter vs mean texture Plotting mean perimeter vs mean area Plotting mean perimeter vs mean smoothness Plotting mean perimeter vs mean compactness Plotting mean perimeter vs mean concavity Plotting mean perimeter vs mean concave points Plotting mean perimeter vs mean symmetry Plotting mean perimeter vs mean fractal dimension Plotting mean perimeter vs radius error Plotting mean perimeter vs texture error Plotting mean perimeter vs perimeter error Plotting mean perimeter vs area error Plotting mean perimeter vs smoothness error Plotting mean perimeter vs compactness error Plotting mean perimeter vs concavity error Plotting mean perimeter vs concave points error Plotting mean perimeter vs symmetry error Plotting mean perimeter vs fractal dimension error Plotting mean perimeter vs worst radius Plotting mean perimeter vs worst texture Plotting mean perimeter vs worst perimeter Plotting mean perimeter vs worst area Plotting mean perimeter vs worst smoothness Plotting mean perimeter vs worst compactness Plotting mean perimeter vs worst concavity Plotting mean perimeter vs worst concave points Plotting mean perimeter vs worst symmetry Plotting mean perimeter vs worst fractal dimension Plotting mean smoothness vs mean texture Plotting mean smoothness vs mean area Plotting mean smoothness vs mean compactness Plotting mean smoothness vs mean concavity Plotting mean smoothness vs mean concave points Plotting mean smoothness vs mean symmetry Plotting mean smoothness vs mean fractal dimension Plotting mean smoothness vs radius error Plotting mean smoothness vs texture error Plotting mean smoothness vs perimeter error Plotting mean smoothness vs area error Plotting mean smoothness vs smoothness error Plotting mean smoothness vs compactness error Plotting mean smoothness vs concavity error Plotting mean smoothness vs concave points error Plotting mean smoothness vs symmetry error Plotting mean smoothness vs fractal dimension error Plotting mean smoothness vs worst radius Plotting mean smoothness vs worst texture Plotting mean smoothness vs worst perimeter Plotting mean smoothness vs worst area Plotting mean smoothness vs worst smoothness Plotting mean smoothness vs worst compactness Plotting mean smoothness vs worst concavity Plotting mean smoothness vs worst concave points Plotting mean smoothness vs worst symmetry Plotting mean smoothness vs worst fractal dimension Plotting mean concavity vs mean texture Plotting mean concavity vs mean area Plotting mean concavity vs mean compactness Plotting mean concavity vs mean concave points Plotting mean concavity vs mean symmetry Plotting mean concavity vs mean fractal dimension Plotting mean concavity vs radius error Plotting mean concavity vs texture error Plotting mean concavity vs perimeter error Plotting mean concavity vs area error Plotting mean concavity vs smoothness error Plotting mean concavity vs compactness error Plotting mean concavity vs concavity error Plotting mean concavity vs concave points error Plotting mean concavity vs symmetry error Plotting mean concavity vs fractal dimension error Plotting mean concavity vs worst radius Plotting mean concavity vs worst texture Plotting mean concavity vs worst perimeter Plotting mean concavity vs worst area Plotting mean concavity vs worst smoothness Plotting mean concavity vs worst compactness Plotting mean concavity vs worst concavity Plotting mean concavity vs worst concave points Plotting mean concavity vs worst symmetry Plotting mean concavity vs worst fractal dimension Plotting mean symmetry vs mean texture Plotting mean symmetry vs mean area Plotting mean symmetry vs mean compactness Plotting mean symmetry vs mean concave points Plotting mean symmetry vs mean fractal dimension Plotting mean symmetry vs radius error Plotting mean symmetry vs texture error Plotting mean symmetry vs perimeter error Plotting mean symmetry vs area error Plotting mean symmetry vs smoothness error Plotting mean symmetry vs compactness error Plotting mean symmetry vs concavity error Plotting mean symmetry vs concave points error Plotting mean symmetry vs symmetry error Plotting mean symmetry vs fractal dimension error Plotting mean symmetry vs worst radius Plotting mean symmetry vs worst texture Plotting mean symmetry vs worst perimeter Plotting mean symmetry vs worst area Plotting mean symmetry vs worst smoothness Plotting mean symmetry vs worst compactness Plotting mean symmetry vs worst concavity Plotting mean symmetry vs worst concave points Plotting mean symmetry vs worst symmetry Plotting mean symmetry vs worst fractal dimension Plotting radius error vs mean texture Plotting radius error vs mean area Plotting radius error vs mean compactness Plotting radius error vs mean concave points Plotting radius error vs mean fractal dimension Plotting radius error vs texture error Plotting radius error vs perimeter error Plotting radius error vs area error Plotting radius error vs smoothness error Plotting radius error vs compactness error Plotting radius error vs concavity error Plotting radius error vs concave points error Plotting radius error vs symmetry error Plotting radius error vs fractal dimension error Plotting radius error vs worst radius Plotting radius error vs worst texture Plotting radius error vs worst perimeter Plotting radius error vs worst area Plotting radius error vs worst smoothness Plotting radius error vs worst compactness Plotting radius error vs worst concavity Plotting radius error vs worst concave points Plotting radius error vs worst symmetry Plotting radius error vs worst fractal dimension Plotting perimeter error vs mean texture Plotting perimeter error vs mean area Plotting perimeter error vs mean compactness Plotting perimeter error vs mean concave points Plotting perimeter error vs mean fractal dimension Plotting perimeter error vs texture error Plotting perimeter error vs area error Plotting perimeter error vs smoothness error Plotting perimeter error vs compactness error Plotting perimeter error vs concavity error Plotting perimeter error vs concave points error Plotting perimeter error vs symmetry error Plotting perimeter error vs fractal dimension error Plotting perimeter error vs worst radius Plotting perimeter error vs worst texture Plotting perimeter error vs worst perimeter Plotting perimeter error vs worst area Plotting perimeter error vs worst smoothness Plotting perimeter error vs worst compactness Plotting perimeter error vs worst concavity Plotting perimeter error vs worst concave points Plotting perimeter error vs worst symmetry Plotting perimeter error vs worst fractal dimension Plotting smoothness error vs mean texture Plotting smoothness error vs mean area Plotting smoothness error vs mean compactness Plotting smoothness error vs mean concave points Plotting smoothness error vs mean fractal dimension Plotting smoothness error vs texture error Plotting smoothness error vs area error Plotting smoothness error vs compactness error Plotting smoothness error vs concavity error Plotting smoothness error vs concave points error Plotting smoothness error vs symmetry error Plotting smoothness error vs fractal dimension error Plotting smoothness error vs worst radius Plotting smoothness error vs worst texture Plotting smoothness error vs worst perimeter Plotting smoothness error vs worst area Plotting smoothness error vs worst smoothness Plotting smoothness error vs worst compactness Plotting smoothness error vs worst concavity Plotting smoothness error vs worst concave points Plotting smoothness error vs worst symmetry Plotting smoothness error vs worst fractal dimension Plotting concavity error vs mean texture Plotting concavity error vs mean area Plotting concavity error vs mean compactness Plotting concavity error vs mean concave points Plotting concavity error vs mean fractal dimension Plotting concavity error vs texture error Plotting concavity error vs area error Plotting concavity error vs compactness error Plotting concavity error vs concave points error Plotting concavity error vs symmetry error Plotting concavity error vs fractal dimension error Plotting concavity error vs worst radius Plotting concavity error vs worst texture Plotting concavity error vs worst perimeter Plotting concavity error vs worst area Plotting concavity error vs worst smoothness Plotting concavity error vs worst compactness Plotting concavity error vs worst concavity Plotting concavity error vs worst concave points Plotting concavity error vs worst symmetry Plotting concavity error vs worst fractal dimension Plotting symmetry error vs mean texture Plotting symmetry error vs mean area Plotting symmetry error vs mean compactness Plotting symmetry error vs mean concave points Plotting symmetry error vs mean fractal dimension Plotting symmetry error vs texture error Plotting symmetry error vs area error Plotting symmetry error vs compactness error Plotting symmetry error vs concave points error Plotting symmetry error vs fractal dimension error Plotting symmetry error vs worst radius Plotting symmetry error vs worst texture Plotting symmetry error vs worst perimeter Plotting symmetry error vs worst area Plotting symmetry error vs worst smoothness Plotting symmetry error vs worst compactness Plotting symmetry error vs worst concavity Plotting symmetry error vs worst concave points Plotting symmetry error vs worst symmetry Plotting symmetry error vs worst fractal dimension Plotting worst radius vs mean texture Plotting worst radius vs mean area Plotting worst radius vs mean compactness Plotting worst radius vs mean concave points Plotting worst radius vs mean fractal dimension Plotting worst radius vs texture error Plotting worst radius vs area error Plotting worst radius vs compactness error Plotting worst radius vs concave points error Plotting worst radius vs fractal dimension error Plotting worst radius vs worst texture Plotting worst radius vs worst perimeter Plotting worst radius vs worst area Plotting worst radius vs worst smoothness Plotting worst radius vs worst compactness Plotting worst radius vs worst concavity Plotting worst radius vs worst concave points Plotting worst radius vs worst symmetry Plotting worst radius vs worst fractal dimension Plotting worst perimeter vs mean texture Plotting worst perimeter vs mean area Plotting worst perimeter vs mean compactness Plotting worst perimeter vs mean concave points Plotting worst perimeter vs mean fractal dimension Plotting worst perimeter vs texture error Plotting worst perimeter vs area error Plotting worst perimeter vs compactness error Plotting worst perimeter vs concave points error Plotting worst perimeter vs fractal dimension error Plotting worst perimeter vs worst texture Plotting worst perimeter vs worst area Plotting worst perimeter vs worst smoothness Plotting worst perimeter vs worst compactness Plotting worst perimeter vs worst concavity Plotting worst perimeter vs worst concave points Plotting worst perimeter vs worst symmetry Plotting worst perimeter vs worst fractal dimension Plotting worst smoothness vs mean texture Plotting worst smoothness vs mean area Plotting worst smoothness vs mean compactness Plotting worst smoothness vs mean concave points Plotting worst smoothness vs mean fractal dimension Plotting worst smoothness vs texture error Plotting worst smoothness vs area error Plotting worst smoothness vs compactness error Plotting worst smoothness vs concave points error Plotting worst smoothness vs fractal dimension error Plotting worst smoothness vs worst texture Plotting worst smoothness vs worst area Plotting worst smoothness vs worst compactness Plotting worst smoothness vs worst concavity Plotting worst smoothness vs worst concave points Plotting worst smoothness vs worst symmetry Plotting worst smoothness vs worst fractal dimension Plotting worst concavity vs mean texture Plotting worst concavity vs mean area Plotting worst concavity vs mean compactness Plotting worst concavity vs mean concave points Plotting worst concavity vs mean fractal dimension Plotting worst concavity vs texture error Plotting worst concavity vs area error Plotting worst concavity vs compactness error Plotting worst concavity vs concave points error Plotting worst concavity vs fractal dimension error Plotting worst concavity vs worst texture Plotting worst concavity vs worst area Plotting worst concavity vs worst compactness Plotting worst concavity vs worst concave points Plotting worst concavity vs worst symmetry Plotting worst concavity vs worst fractal dimension Plotting worst symmetry vs mean texture Plotting worst symmetry vs mean area Plotting worst symmetry vs mean compactness Plotting worst symmetry vs mean concave points Plotting worst symmetry vs mean fractal dimension Plotting worst symmetry vs texture error Plotting worst symmetry vs area error Plotting worst symmetry vs compactness error Plotting worst symmetry vs concave points error Plotting worst symmetry vs fractal dimension error Plotting worst symmetry vs worst texture Plotting worst symmetry vs worst area Plotting worst symmetry vs worst compactness Plotting worst symmetry vs worst concave points Plotting worst symmetry vs worst fractal dimension Número de gráficos: 330
In [10]:
Copied!
from sklearn.decomposition import PCA
pca = PCA()
#pca.fit(X)
#pca.transform(X)
pcs = pca.fit_transform(X)
print(pcs.shape)
print(X.shape)
from sklearn.decomposition import PCA
pca = PCA()
#pca.fit(X)
#pca.transform(X)
pcs = pca.fit_transform(X)
print(pcs.shape)
print(X.shape)
(569, 30) (569, 30)
In [11]:
Copied!
plt.figure(figsize=(10,6))
plt.scatter(pcs[:,0], pcs[:,1], c = color_map)
plt.title('Gráfico 2 componentes principales')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
plt.figure(figsize=(10,6))
plt.scatter(pcs[:,0], pcs[:,1], c = color_map)
plt.title('Gráfico 2 componentes principales')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
Usando PCA para entrenamiento¶
In [12]:
Copied!
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X = cancer_data.data
y = cancer_data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
# Instanciar Escalador Estándar
scaler = StandardScaler()
# Ajustar y transformar datos
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X = cancer_data.data
y = cancer_data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
# Instanciar Escalador Estándar
scaler = StandardScaler()
# Ajustar y transformar datos
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [13]:
Copied!
pca = PCA(n_components=10) # Prueba cambiando el número de componentes!!
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
pca = PCA(n_components=10) # Prueba cambiando el número de componentes!!
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
In [14]:
Copied!
%%time
# Sin PCA
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)
%%time
# Sin PCA
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg.score(X_test, y_test)
CPU times: user 34.7 ms, sys: 172 ms, total: 206 ms Wall time: 16.3 ms
Out[14]:
0.9824561403508771
In [15]:
Copied!
%%time
# Con PCA
logreg_pca = LogisticRegression()
logreg_pca.fit(X_train_pca, y_train)
logreg_pca.score(X_test_pca, y_test)
%%time
# Con PCA
logreg_pca = LogisticRegression()
logreg_pca.fit(X_train_pca, y_train)
logreg_pca.score(X_test_pca, y_test)
CPU times: user 5.1 ms, sys: 602 µs, total: 5.7 ms Wall time: 4.73 ms
Out[15]:
0.9736842105263158
Escogiendo cantidad de componentes PCA¶
Existen varios criterios para escoger los componentes más importantes
- Criterio del codo en base a la varianza (utilizar atributo
explained_variance_ratio_presente en objeto PCA)
In [16]:
Copied!
pca = PCA()
X_train_pca = pca.fit_transform(X_train)
plt.figure(figsize=(10,6))
plt.plot(pca.explained_variance_ratio_,'bo-')
plt.show()
pca = PCA()
X_train_pca = pca.fit_transform(X_train)
plt.figure(figsize=(10,6))
plt.plot(pca.explained_variance_ratio_,'bo-')
plt.show()
- Porcentaje de la varianza a tomar en cuenta. Este dato se entrega en el entrenamiento.
- Ej. Si deseamos las componentes que contienen el 95% de la varianza, entrenamos el algoritmo PCA como se ve a continuación
In [17]:
Copied!
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train)
print("El número de componentes es", pca.n_components_, "de", pca.n_features_)
pca = PCA(n_components=0.95)
X_train_pca = pca.fit_transform(X_train)
print("El número de componentes es", pca.n_components_, "de", pca.n_features_)
El número de componentes es 10 de 30
Ejemplo de uso PCA: eigenfaces¶
In [18]:
Copied!
from sklearn.datasets import fetch_olivetti_faces
faces = fetch_olivetti_faces()
X = faces.data
y = faces.target
fig = plt.figure(figsize=(13,13))
n = 16
for i in range(1,n+1):
ax = fig.add_subplot(4,4,i)
ax.imshow(X[i].reshape(64,64), cmap='gray')
plt.show()
print(X.shape)
from sklearn.datasets import fetch_olivetti_faces
faces = fetch_olivetti_faces()
X = faces.data
y = faces.target
fig = plt.figure(figsize=(13,13))
n = 16
for i in range(1,n+1):
ax = fig.add_subplot(4,4,i)
ax.imshow(X[i].reshape(64,64), cmap='gray')
plt.show()
print(X.shape)
(400, 4096)
In [19]:
Copied!
from sklearn.decomposition import PCA
pca = PCA()
pcs = pca.fit_transform(X)
pca.components_.shape
from sklearn.decomposition import PCA
pca = PCA()
pcs = pca.fit_transform(X)
pca.components_.shape
Out[19]:
(400, 4096)
In [20]:
Copied!
pca.singular_values_
pca.singular_values_
Out[20]:
array([8.67019730e+01, 6.64652557e+01, 5.01551704e+01, 3.97225189e+01,
3.37573624e+01, 3.15687275e+01, 2.76786003e+01, 2.53545341e+01,
2.48624077e+01, 2.29751530e+01, 2.24406242e+01, 2.12985268e+01,
1.98386669e+01, 1.90296707e+01, 1.83174915e+01, 1.75683746e+01,
1.70332050e+01, 1.60456028e+01, 1.54267321e+01, 1.53560791e+01,
1.48501873e+01, 1.39293432e+01, 1.35770054e+01, 1.34108419e+01,
1.31309624e+01, 1.29574947e+01, 1.27358618e+01, 1.25111065e+01,
1.20198154e+01, 1.18014135e+01, 1.12651854e+01, 1.10127869e+01,
1.06893015e+01, 1.02766113e+01, 1.00567503e+01, 9.98840523e+00,
9.81475353e+00, 9.70944691e+00, 9.43751049e+00, 9.30046749e+00,
9.05566788e+00, 8.95432854e+00, 8.78629017e+00, 8.70131207e+00,
8.53743458e+00, 8.45436287e+00, 8.37684822e+00, 8.34203815e+00,
8.12110233e+00, 8.04420757e+00, 7.88286734e+00, 7.77384901e+00,
7.64300728e+00, 7.51582861e+00, 7.48782730e+00, 7.37899303e+00,
7.29514360e+00, 7.19894505e+00, 7.14871931e+00, 7.07121754e+00,
7.00469780e+00, 6.93232012e+00, 6.88031626e+00, 6.82861328e+00,
6.70961332e+00, 6.66213179e+00, 6.57492971e+00, 6.50328207e+00,
6.42863846e+00, 6.37451029e+00, 6.34244680e+00, 6.31260109e+00,
6.25152540e+00, 6.18892288e+00, 6.10028791e+00, 6.02212143e+00,
6.00382137e+00, 5.89770222e+00, 5.88435698e+00, 5.85578203e+00,
5.79922915e+00, 5.74895144e+00, 5.66556644e+00, 5.58420229e+00,
5.54736567e+00, 5.51564598e+00, 5.49353647e+00, 5.42735195e+00,
5.36528921e+00, 5.35552740e+00, 5.30220890e+00, 5.27387190e+00,
5.25750446e+00, 5.22422743e+00, 5.10435677e+00, 5.04701757e+00,
5.01673603e+00, 5.00707054e+00, 4.98338270e+00, 4.93973160e+00,
4.89162683e+00, 4.87898350e+00, 4.83393717e+00, 4.81472349e+00,
4.76173401e+00, 4.73932409e+00, 4.70194387e+00, 4.65496778e+00,
4.61132574e+00, 4.59201193e+00, 4.55551958e+00, 4.54440308e+00,
4.49239588e+00, 4.46295786e+00, 4.43636084e+00, 4.42737246e+00,
4.39764738e+00, 4.32589483e+00, 4.31439400e+00, 4.27803040e+00,
4.25079823e+00, 4.21890879e+00, 4.18245316e+00, 4.15838623e+00,
4.14846373e+00, 4.13530159e+00, 4.12223387e+00, 4.08655691e+00,
4.06819963e+00, 4.04153681e+00, 4.01971722e+00, 3.96746087e+00,
3.94706655e+00, 3.91783357e+00, 3.89643955e+00, 3.88215613e+00,
3.86362410e+00, 3.83839226e+00, 3.82244968e+00, 3.79968786e+00,
3.75180507e+00, 3.74326634e+00, 3.72621131e+00, 3.69316649e+00,
3.66814423e+00, 3.66080093e+00, 3.64558458e+00, 3.63340139e+00,
3.59850621e+00, 3.58630037e+00, 3.57101560e+00, 3.54475188e+00,
3.53729439e+00, 3.51334643e+00, 3.50319576e+00, 3.49361658e+00,
3.46752691e+00, 3.43627882e+00, 3.42637277e+00, 3.41486120e+00,
3.39643526e+00, 3.37961507e+00, 3.36745453e+00, 3.34750700e+00,
3.32401586e+00, 3.30961466e+00, 3.29122925e+00, 3.26569581e+00,
3.25932407e+00, 3.23354721e+00, 3.22916937e+00, 3.21766758e+00,
3.18845677e+00, 3.16987205e+00, 3.14393115e+00, 3.13798332e+00,
3.13632321e+00, 3.10858941e+00, 3.09750390e+00, 3.08093834e+00,
3.05235863e+00, 3.03736448e+00, 3.01557088e+00, 2.99736953e+00,
2.97718096e+00, 2.96695542e+00, 2.93944931e+00, 2.93521857e+00,
2.91714478e+00, 2.91031051e+00, 2.89646721e+00, 2.88030362e+00,
2.86871934e+00, 2.85353565e+00, 2.84549403e+00, 2.82542944e+00,
2.81264138e+00, 2.80306578e+00, 2.78921080e+00, 2.75221419e+00,
2.74593949e+00, 2.72394443e+00, 2.72214007e+00, 2.71539831e+00,
2.70571685e+00, 2.68297577e+00, 2.67742538e+00, 2.67121172e+00,
2.65917253e+00, 2.63410902e+00, 2.62550092e+00, 2.61427116e+00,
2.59991026e+00, 2.57830143e+00, 2.57135749e+00, 2.56946087e+00,
2.53385353e+00, 2.52855587e+00, 2.50926447e+00, 2.50185418e+00,
2.48441792e+00, 2.47901320e+00, 2.47376800e+00, 2.45884299e+00,
2.44685650e+00, 2.44085217e+00, 2.42005348e+00, 2.41766381e+00,
2.40895438e+00, 2.39749312e+00, 2.39310122e+00, 2.37524271e+00,
2.35588694e+00, 2.34345913e+00, 2.32700181e+00, 2.31737208e+00,
2.30046058e+00, 2.29415107e+00, 2.28666425e+00, 2.28088403e+00,
2.25738430e+00, 2.24987793e+00, 2.24142313e+00, 2.22998047e+00,
2.22279263e+00, 2.21069574e+00, 2.19862056e+00, 2.19633174e+00,
2.18271542e+00, 2.16603637e+00, 2.16189456e+00, 2.14149523e+00,
2.13907576e+00, 2.12251067e+00, 2.10614324e+00, 2.10067153e+00,
2.09049845e+00, 2.08798003e+00, 2.08485126e+00, 2.07021046e+00,
2.05914712e+00, 2.05065632e+00, 2.04303241e+00, 2.02948570e+00,
2.01394176e+00, 2.00638318e+00, 2.00006008e+00, 1.99645925e+00,
1.98778737e+00, 1.97274339e+00, 1.96629345e+00, 1.95498490e+00,
1.95113802e+00, 1.94224048e+00, 1.93135357e+00, 1.92090797e+00,
1.91690063e+00, 1.90727794e+00, 1.89598942e+00, 1.88704824e+00,
1.88123226e+00, 1.87268841e+00, 1.86351442e+00, 1.85769570e+00,
1.85189319e+00, 1.84708476e+00, 1.83549905e+00, 1.83239698e+00,
1.81746566e+00, 1.80466330e+00, 1.79198551e+00, 1.78123844e+00,
1.76884556e+00, 1.75707817e+00, 1.75288594e+00, 1.74518907e+00,
1.73888075e+00, 1.73014629e+00, 1.72320902e+00, 1.71455085e+00,
1.71072924e+00, 1.69756186e+00, 1.69096172e+00, 1.68947208e+00,
1.68282592e+00, 1.65408599e+00, 1.65107608e+00, 1.64000714e+00,
1.63345230e+00, 1.62733793e+00, 1.62501168e+00, 1.61390746e+00,
1.61244512e+00, 1.60437238e+00, 1.58399594e+00, 1.58141637e+00,
1.57921565e+00, 1.56690443e+00, 1.56132340e+00, 1.54932296e+00,
1.54774010e+00, 1.54119098e+00, 1.52550340e+00, 1.51659071e+00,
1.50771141e+00, 1.49963534e+00, 1.49366891e+00, 1.48807275e+00,
1.48013783e+00, 1.47570050e+00, 1.47361946e+00, 1.46197701e+00,
1.45815635e+00, 1.44604146e+00, 1.43327284e+00, 1.43002021e+00,
1.42324698e+00, 1.41398966e+00, 1.40300941e+00, 1.39303148e+00,
1.38711798e+00, 1.38159633e+00, 1.37085938e+00, 1.35424531e+00,
1.34968066e+00, 1.34868908e+00, 1.33929455e+00, 1.33017850e+00,
1.31506407e+00, 1.31415212e+00, 1.29742253e+00, 1.29378426e+00,
1.27884734e+00, 1.27379715e+00, 1.26695335e+00, 1.25753713e+00,
1.24858522e+00, 1.24823749e+00, 1.23656499e+00, 1.22576010e+00,
1.22270632e+00, 1.21047401e+00, 1.20913112e+00, 1.20086586e+00,
1.18829727e+00, 1.17658353e+00, 1.16693616e+00, 1.15184510e+00,
1.14749789e+00, 1.14276719e+00, 1.13625574e+00, 1.12693357e+00,
1.11847591e+00, 1.10988426e+00, 1.10522521e+00, 1.09566057e+00,
1.09356129e+00, 1.07322633e+00, 1.06727886e+00, 1.05904663e+00,
1.03996456e+00, 1.03270960e+00, 1.02170205e+00, 1.01863468e+00,
1.00697589e+00, 1.00185668e+00, 9.90437388e-01, 9.79656398e-01,
9.62462962e-01, 9.40755069e-01, 9.24248576e-01, 9.06024814e-01,
8.86753857e-01, 8.70458901e-01, 8.34254324e-01, 8.28883350e-01,
8.22203100e-01, 7.95056880e-01, 7.19473183e-01, 1.97418791e-04],
dtype=float32)
In [21]:
Copied!
fig = plt.figure(figsize=(13,13))
n = 16
for i in range(0,n):
ax = fig.add_subplot(4,4,i+1)
ax.imshow(pca.components_[i].reshape(64,64), cmap='gray')
plt.show()
fig = plt.figure(figsize=(13,13))
n = 16
for i in range(0,n):
ax = fig.add_subplot(4,4,i+1)
ax.imshow(pca.components_[i].reshape(64,64), cmap='gray')
plt.show()
Linear Discriminant Analysis (LDA)¶
- Es un algoritmo de clasificación, que aprende los ejes más discriminativos entre clases.
- La proyección resultante mantiene las clases tan separadas como sea posible.
In [22]:
Copied!
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
import pandas as pd
cancer_data = load_breast_cancer(as_frame=True) # Para obtener datos como dataframe
print(cancer_data.target_names) # 0 Maligno, 1 benigno
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
import pandas as pd
cancer_data = load_breast_cancer(as_frame=True) # Para obtener datos como dataframe
print(cancer_data.target_names) # 0 Maligno, 1 benigno
['malignant' 'benign']
In [23]:
Copied!
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
X = cancer_data.data
y = cancer_data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
# Instanciar Escalador Estándar
scaler = StandardScaler()
# Ajustar y transformar datos
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
X = cancer_data.data
y = cancer_data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
# Instanciar Escalador Estándar
scaler = StandardScaler()
# Ajustar y transformar datos
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [24]:
Copied!
lda = LinearDiscriminantAnalysis()
X_train_lda = lda.fit_transform(X_train, y_train)
lda.score(X_test, y_test)
lda = LinearDiscriminantAnalysis()
X_train_lda = lda.fit_transform(X_train, y_train)
lda.score(X_test, y_test)
Out[24]:
0.956140350877193
In [25]:
Copied!
# Proyectando
colors = {0:'red', 1:'green'}
color_map = y_train.map(colors)
plt.figure(figsize=(10,6))
plt.scatter(X_train_lda,np.zeros_like(X_train_lda), c=color_map)
plt.show()
# Proyectando
colors = {0:'red', 1:'green'}
color_map = y_train.map(colors)
plt.figure(figsize=(10,6))
plt.scatter(X_train_lda,np.zeros_like(X_train_lda), c=color_map)
plt.show()
Actividad 9¶
¡Hora de poner en práctica todo lo aprendido hasta ahora!
- Estudie el dataset California Housing presente en los dataset de Scikit-Learn (enlace aquí)
- Cree un Pipeline completo que permita predecir el precio de una casa.
In [ ]:
Copied!
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
""")
#codigo extra, para que imagenes de matplotlib
#estén centradas en las diapositivas, ejecutar antes de lanzar los ejemplos.
from IPython.core.display import HTML
HTML("""
""")
#codigo extra, para que imagenes de matplotlib
#estén centradas en las diapositivas, ejecutar antes de lanzar los ejemplos.